US Environmental Protection Agency (EPA) air quality data, more specifically PM2.5 data, has been used as the main source for an in-depth data analysis of New York City's (NYC) air quality for the period 2010 to 2020. EPA collects and stores air quality data on an annual basis, which is why it was necessary to download a dataset for each year from the period of 2010 to 2020 and compile the individual datasets into a single csv file. The total size of the compiled datasest "full_pm_emissions.csv" is 10,6 MB and contains 53925 rows x 22 columns.
In order to examine the impact of air quality on New York City residents, the Hospital Inpatient Discharges (SPARCS De-Identified) dataset was used. The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified dataset contains details on patient characteristics and diagnoses which have been used to examine a correlation between air quality and hospital admissions in NYC. Like the NYC air quality dataset, SPARC collects and stores data on hospital admissions on an annual basis. It has therefore been necessary to extract individual datasets on hospital admissions from 2010 to 2020 and compile these datasets into a single csv file. The total size of the compiled datasest "full_admissions.csv" is 369 MB and contains 968724 rows × 34 columns.
Furthermore, an additional dataset of national PM2.5 averages from the EPA has been used to compare the air quality of NYC with the US average air quality between 2010 to 2020. The total size of this dataset is 2 KB and contains 21 rows x 5 columns. GeoJSON data containing latitude and longtitude data of NYC boundaries has also been used in order to highlight boroughs in NYC with the highest number of PM2.5 in NYC. The total size of the GeoJSON file is 3.1 MB.
import warnings
import pandas as pd
import numpy as np
import json
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from bokeh.layouts import column
from bokeh.models import ColumnDataSource, RangeTool, BoxAnnotation, Label, LabelSet, HoverTool
from bokeh.plotting import figure, show, output_file, save
from bokeh.models.formatters import NumeralTickFormatter
from bokeh.palettes import Category20
from bokeh.palettes import Category10
warnings.filterwarnings('ignore')
df_pm25 = pd.read_csv("full_pm_emmission.csv")
df_n_avg = pd.read_csv("PM25National.csv")
df_admissions = pd.read_csv("full__admissions.csv")
data_boundaries = json.load(open("Borough Boundaries.geojson"))
We have specifically chosen to examine PM2.5 data, as these particles is assesed to be very harmful to human health (Ηow Air Pollution Affects Our Health, 2023) and since NYC collects data on these particles in all boroughs. In addition, the NYC Mayor's Office of Climate and Environmental Justice has also set a target to reduce disparities in ambient pollution level exposures within the city by 20 percent for PM2.5 (Air Quality - Mayor’s Office of Sustainability, n.d.).
Many years ago, NYC was one of the most polluted cities in the world due to industrial activities, transportation and population density. In the early 20th century, many factories and power plants in and around the city burned coal and oil, releasing large amounts of pollutants into the air (Dwyer, 2017). Our purpose with this data analysis is to inform the reader about the environmental developments that NYC has undergone over the last decade. We will focus on the positive environmental developments that the city has undergone and highlight the effects that the current initiatives (Varghese, 2022) taken by the Mayor's Office have had on the city's air quality and PM2.5 levels.
Of the 4 datasets used, the 2 data sets containing emission- and hospital admissions data had to be preprocessed and cleaned before they were able to be used for the analyses. As both datasets were extracted year by year, the first preprocessing step was to combine each of these datasets into one large file which contained the entire period from 2010 to 2020 for both emission and admissions. Secondly, both datasets were at this point very large, so we delimited the sets to only include the relevant information. First, both datasets had a column which contained the county information. Since we were only looking at New York City, we started off by filtering on the 5 counties corresponding to the 5 boroughs. Secondly, from the admissions data we were only interested in the admissions for respiratory disease patients, which could be filtered from a column containing a diagnostic category.
For cleaning the data, we noticed that some of the emission measures contained negative PM2.5 concentration values. Since negative concentration measures are not physically possible (Since you cannot have a negative amount of fine particles in the air), these measurements were removed from the emission dataset. These negatives values can be seen in the scatterplot below:
plt.figure(figsize=(15,8))
plt.scatter(df_pm25["Date"], df_pm25["Daily Mean PM2.5 Concentration"], alpha=0.8, s=1)
plt.axhline(y=0, color='r', linestyle='-')
plt.xticks([])
plt.show()
# Read the raw data files.
#df1 = pd.read_csv(‘filename.csv’, sep=’data separation symbol’)
# Getting an overview of the size of the data
#df1["Column name"].value_counts()
# Delimit the dataset to the relevant information
#df1 = df1.loc[(df1["Column name"] == "Column value")]
# Insert columns so that all the data had the same columns / format.
#df1.insert(loc = *Column position*, column = “Column name”, value = arr0)
# Combine files into 1 dataframe
#df_full = pd.concat((df1, df2), axis = 0)
# Removing unused columns
#df_full = df_full.drop("Column name", axis='columns')
# Sorting by date
#df_full.sort_values(by='Date column', inplace=True)
# Resetting the dataframe index
#df_full = df_full.reset_index(drop=True)
# To export the data to a new file.
#df_full.to_csv('processed_filename.csv’)
Within the actual analysis we also did some preprocessing and cleaning which enabled the data to be ready for plotting. The column names were changed so that they were the same across the datasets (For example that the borough column was simply called ‘Borough’). Since the raw data has a column for the New York State counties, we wanted to rename ‘Richmond’ to ‘Staten Island’, ‘Kings’ to ‘Brooklyn’ and ‘New York’ to ‘Manhattan’, since these borough names are more commonly used within NYC. Queens and Bronx have the same county name as borough name. We also created columns which contained just the year, month or season of the emission measure, which was used for the season plot and the national average comparison plot. The Borough column name also needed to be renamed, so that it matched the name of the ‘boro_name’ feature in the geojson data. Lastly, data was grouped, pivoted, merged, averaged and rounded to create the input dataframes the plots.
# Renaming column header
df_pm25.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'])
df_pm25['Date'] = pd.to_datetime(df_pm25['Date'])
# Extracting month and year from date column
df_pm25['Month'] = df_pm25['Date'].dt.month
df_pm25['Year'] = df_pm25['Date'].dt.year
# Renaming column values
df_admissions['Hospital County'] = df_admissions['Hospital County'].str.replace('Richmond', 'Staten Island')
df_admissions['Hospital County'] = df_admissions['Hospital County'].str.replace('Kings', 'Brooklyn')
# Grouping and averaging data
PM_avg = df_pm25.groupby(["Year", "Borough"])['Daily Mean PM2.5 Concentration'].mean()
PM_avg = PM_avg.reset_index()
# Renaming column header
PM_avg.rename(columns = {'Borough':'boro_name'}, inplace = True)
PM_avg.rename(columns = {'Daily Mean PM2.5 Concentration':'Average PM2.5'}, inplace = True)
# Pivoting the data, so that 1 column is split into separate column headers
PM_avg = PM_avg.pivot(index='boro_name', columns='Year', values='Average PM2.5').reset_index()
# Grouping and averaging data
PM_avg2 = df_pm25.groupby(["Year"])['Daily Mean PM2.5 Concentration'].mean()
PM_avg2 = PM_avg2.reset_index()
# Renaming column header
PM_avg2.rename(columns = {'Daily Mean PM2.5 Concentration':'New York City'}, inplace = True)
# Merging 2 datasets (Left join), where rows from 1 dataset are inserted wherever column values match in the other dataset.
PM_avg2 = pd.merge(PM_avg2,df_n_avg, left_on='Year', right_on='Year', how='left')
# Dropping unused columns from merge
PM_avg2 = PM_avg2.drop(columns=['Number of Trend Sites', '10th Percentile', '90th Percentile'])
# Renaming column header
PM_avg2.rename(columns = {'Mean':'National average'}, inplace = True)
# Rounding column values
PM_avg2 = PM_avg2.round(3)
See "What is your dataset?" above for precise description of the dataset meta data.
We could have delimited the datasets further, as there were still columns in the emission and admissions data which were not used for the analyses. However, we determined that the loading of the data was fast enough to not hinder our workflow. Overall, the data contained all the information that we felt was needed for the analyses and plots that wanted to perform and create. We created an initial overview of the data, as seen in the plot below:
dates = np.array(df_pm25['Date'], dtype=np.datetime64)
source = ColumnDataSource(data=dict(date=dates, close=df_pm25['Daily Mean PM2.5 Concentration'], borough=df_pm25['Borough']))
# define a list of colors
colors = Category10[len(df_pm25['Borough'].unique())]
p = figure(height=300, width=800, tools="xpan", toolbar_location=None,
x_axis_type="datetime", x_axis_location="above",
background_fill_color="#efefef", x_range=(dates[1500], dates[2500]))
# group the data by borough and plot each group separately
for i, (borough, group) in enumerate(df_pm25.groupby('Borough')):
group_dates = np.array(group['Date'], dtype=np.datetime64)
group_source = ColumnDataSource(data=dict(date=group_dates, close=group['Daily Mean PM2.5 Concentration']))
p.line('date', 'close', source=group_source, legend_label=borough, line_color=colors[i])
p.yaxis.axis_label = 'PM2.5 Concentration'
p.legend.location = "top_left"
p.legend.click_policy= "hide"
select = figure(title="Drag the middle and edges of the selection box to change the range above",
height=130, width=800, y_range=p.y_range,
x_axis_type="datetime", y_axis_type=None,
tools="", toolbar_location=None, background_fill_color="#efefef")
range_tool = RangeTool(x_range=p.x_range)
range_tool.overlay.fill_color = "navy"
range_tool.overlay.fill_alpha = 0.2
select.line('date', 'close', source=source)
select.ygrid.grid_line_color = None
select.add_tools(range_tool)
show(column(p, select))
This dataset showed us that there was not a remarkable difference between the different borough measurements, and that the PM2.5 concentration measurements were mostly between 0 and 30 micrograms per cubic meter. At this point, we could already see that the average measurements seemed to decrease over time, which meant that the air pollution reduction initiatives implemented by the NYC government had been successful to some degree.
df_admissions.rename(columns = {'Hospital County':'Borough'}, inplace = True)
df_admissions.rename(columns = {'Discharge Year':'Year'}, inplace = True)
pd.set_option('display.max_columns', None)
df_grouped = df_admissions.groupby(["Year", "Borough"]).count().reset_index()
df_grouped = df_grouped[["Year", "Borough", "Unnamed: 0"]]
df_grouped.rename(columns = {'Unnamed: 0':'Count'}, inplace = True)
df_grouped = df_grouped.pivot(index='Year', columns='Borough', values='Count').reset_index()
df_grouped.rename(columns = {'Kings':'Brooklyn'}, inplace = True)
df_grouped.rename(columns = {'Richmond':'Staten Island'}, inplace = True)
fig = px.bar(df_grouped, x='Year', y=['Manhattan', 'Brooklyn', 'Bronx', 'Queens', 'Staten Island'],
range_y=[0,43000],
color_discrete_sequence=["red", "blue", "green", "yellow", "orange"],
)
fig.update_layout(barmode='group')
fig.update_layout(legend=dict(
yanchor="top",
y=0.99,
xanchor="left",
x=0.01,
title='Borough'
), yaxis_title='Count of respiratory admissions')
fig.update_traces(hovertemplate = None,
hoverinfo = "skip")
fig.show()
From this chart, we could see that Manhattan has the highest number of admissions of the 5 boroughs. This could be expected since it is the most populated. In general, the number respiratory disease admissions follow the population within each borough. Also, from this chart we can see that there is a slight downwards trend, which means that there is something affecting the admission rates in each borough. Finally, we can see that 2020 has significantly higher admissions, which is as expected when considering this was the year wherein Covid-19 had started to spread within the United States.
The idea for data analysis of PM2.5 air particles and to investigate the correlation between these air particles and the number of hospitalizations with respiratory diseases came from a group member's uncle who has lived in NYC for more than 20 years. He explained some of the negative experiences he has had with the city's air pollution, but he also explained some about the positive environmental development that NYC has undergone in recent years. Therefore, we felt that this could be interesting to investigate more in depth and analytically.
The data analysis is thus based on describing the overall development of PM2.5 in the years from 2010 to 2020, where there will be special a focus on describing the overall trends in air pollution in NYC and which seasons and boroughs experience the most air pollution caused by PM2.5. Additionally, we will examine the which boroughs experience the most hospital admissions caused by respiratory diseases and investigate whether there could be a correlation between the air pollution and hospital admissions and which age groups are particularly vulnerable. Through our data analysis, we have learned that air pollution in NYC has improved significantly in recent years. This can be particularly be seen in the downward air pollution trend and likewise, it can be seen that the number of hospitalizations with respiratory diseases has decreased significantly by 2020. This reasons for this evolution can be explained by the various environmental initiatives introduced by the Mayor's Office in the last decade. Although we had an idea before the data analysis that air pollution caused by PM2.5 was decreasing, we were surprised by how much it had actually decreased. We were particularly surprised by how much NYC's air pollution had decreased compared to the US average air pollution.
To tell the intended story from our data analysis, we used the Magazine Style genre to communicate our findings. This genre suited the narrative, as it enabled us to interpret the visual graphs in detailed writing, which would guide the reader to understand why the story was interesting the investigate. The use of the martini glass structure made it possible to convey the information in a structured approach, wherein the general overview was initially communicated, followed deep-dives into the areas of interest.
For the visual narrative, the following approaches were used within the 3 categories:
Visual Structuring: Consistent Visual Platform
The website that the story is being told from has a consistent visual structure, which is reminiscent of other news- or magazine online platforms, with text columns containing in-line visuals to accompany the text. This use of familiar and consistent visual structure will let the reader understand what the intended genre of the story is going to be, even before they start reading.
Highlighting: Feature Distinction
In the text, and within the visual graphs created for the story, we intend to highlight and interpret distinct features of the data that is being analyzed. This will help the reader to understand what the key findings for each visual are.
Transition Guidance: Object continuity
Due to the linearity of the story telling, the transitions between story elements are mostly communicated through text references alone. Otherwise, the data used for each graph considers different elements of the same few datasets, which gives some analytical continuity between the sections of the story.
For the narrative structure, the following approaches were used within the 3 categories:
Ordering: Linear, User directed path
As we are using the magazine style, the structuring of the story is told primarily in a linear flow. The reader is intended to read the story from beginning to end, as the visuals and accompanying text have been strung together with a story that builds on previous points and conclusions. However, the use of the website has some user directed input, as we have a specified tab which has some additional explanatory information on, for example, the harmful effects of PM2.5 exposure.
Interactivity: Hover Highlighting / Details, Filtering / Selection / Search
Using the website to host the magazine story, it was possible to integrate user input to the visuals of the story. Thereby, could interact with the plots by hovering for more information, or filtering for specific years in the charts. Using this interactivity, made it possible for the reader to play around with the presented information, and thereby increase the user engagement compared to a more static visualization.
Messaging: Introductory Text, Captions / Headlines
Finally, using introductory text to graphs and sections of the story will help clarify to the reader what the intended messages of specific story points are. The use of captions and headlines should reinforce this perception, as the reader is able to quickly understand the purpose of sections through their given headline.
This plot is a line chart which is made with the use of the Bokeh library to create a “range plot” (the interactive slide plot in the bottom). Furthermore, the plot includes horizontally placed “boxes” which mark specific areas of pollution levels provided by the World Health Organization (WHO).
This line chart is chosen to show the development of PM2.5 concentration for NYC which for us seems quite obvious to include as this is a quite essential part of the story we want to tell. We chose to use the line chart as there are an enormous amount of observations as there is an observation for each day throughout ten years. We found this to be most effectively shown with a line chart as it is the easiest way to comprehend the trend for that many observations. Furthermore, we decided on making the plot interactive for the reader making it possible to examine specific time periods or spikes in the data. We chose to use the coloured boxes which were chosen as the colors we used are the ones which generally are associated with either high pollution (red) or low pollution (green).
groups = df_pm25.groupby('Date')["Daily Mean PM2.5 Concentration"].mean()
groups = pd.DataFrame(groups).reset_index()
dates = np.array(groups['Date'], dtype=np.datetime64)
source1 = ColumnDataSource(data=dict(date=dates, close=groups['Daily Mean PM2.5 Concentration']))
p1 = figure(height=300, width=800, tools="xpan", toolbar_location=None,
x_axis_type="datetime", x_axis_location="above",
background_fill_color="#efefef", x_range=(dates[1500], dates[2500]))
# Define the green box
green_box = BoxAnnotation(bottom=-5, top=5, fill_color='green', fill_alpha=0.4)
p1.add_layout(green_box)
light_green_box = BoxAnnotation(bottom=5, top=10, fill_color='green', fill_alpha=0.3)
p1.add_layout(light_green_box)
yellow_box = BoxAnnotation(bottom=10, top=15, fill_color='yellow', fill_alpha=0.3)
p1.add_layout(yellow_box)
orange_box = BoxAnnotation(bottom=15, top=25, fill_color='orange', fill_alpha=0.3)
p1.add_layout(orange_box)
red_box = BoxAnnotation(bottom=25, top=40, fill_color='red', fill_alpha=0.3)
p1.add_layout(red_box)
# Plot line and save renderer
line_renderer = p1.line('date', 'close', source=source1, line_color='blue')
p1.yaxis.axis_label = 'PM2.5 Concentration'
# Move line renderer to end of renderer list to bring it to front
p1.renderers.append(line_renderer)
select = figure(title="Drag the middle and edges of the selection box to change the range above",
height=130, width=800, y_range=p1.y_range,
x_axis_type="datetime", y_axis_type=None,
tools="", toolbar_location=None, background_fill_color="#efefef")
range_tool = RangeTool(x_range=p1.x_range)
range_tool.overlay.fill_color = "navy"
range_tool.overlay.fill_alpha = 0.2
select.line('date', 'close', source=source1)
select.ygrid.grid_line_color = None
select.add_tools(range_tool)
show(column(p1, select))
#output_file("bokeh_linechart.html")
#save(p1)
To visualize the variations of each season we have chosen to make a boxplot with the Plotly library. For this visual we decided to make an interactive slider which is able to be played as an animation and thereby show the development of PM2.5 concentration levels for each season over the last decade.
We wanted to gain a better understanding of which contributors could be the main reason for high PM2.5 levels for which we thought it would be useful to see the concentration levels for each season which could indicate underlying contributors for increased pollution. We found that the boxplot was the most insightful as it made it possible to make a more in-depth analysis by seeing the distribution of the observations, compared to a bar chart or similar.
def get_season(month):
if month in [12, 1, 2]:
return 'Winter'
elif month in [3, 4, 5]:
return 'Spring'
elif month in [6, 7, 8]:
return 'Summer'
else:
return 'Fall'
#Apply function to dataset to extract seasons:
df_pm25['Season'] = df_pm25['Month'].apply(get_season)
color_map = {'Winter': 'blue', 'Spring': 'green', 'Summer': 'yellow', 'Fall': 'orange'}
fig = px.box(df_pm25, x='Season', y='Daily Mean PM2.5 Concentration', color='Season', animation_frame='Year',
color_discrete_map=color_map,
range_y=[0, df_pm25['Daily Mean PM2.5 Concentration'].max() + 10],
labels={'<b>Daily Mean PM2.5 Concentration<\b>': '<b>PM2.5 Concentration (ug/m3)<\b>'},
title='<b>PM2.5 Concentration in NYC by Season and Year</b>')
# Add slider
fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1000
fig.layout.sliders[0].pad = {'t': 50}
fig.layout.sliders[0].len = 0.9
fig.layout.sliders[0].currentvalue.prefix = 'Year: '
fig.layout.sliders[0].currentvalue.font.size = 18
# Show plot
fig.show()
#fig.write_html("boxplot.html")
We have decided to implement a choropleth map visualizating the PM2.5 air polution in NYC in the period from 2010 to 2020 using the Plotly graph objects package (go). We specifically chose this plot since we wanted to visualize the air polution evolution in the different NYC boroughs and since we wanted to highlight which boroughs are mostly affected by the PM2.5 air polution. This plot is right for our story since it can show the air polution evolution over time (by using the dropdown) and thereby visualize the positively evolution of PM2.5 polution in NYC. Furthermore, by visualizing the air polution of PM2.5 you would be able to look into the the underlying causes to which specific boroughs are most affected by PM2.5 than other boroughs. The code for this plot can be seen in the cell below:
df = PM_avg
geojson = data_boundaries
fig = go.Figure(go.Choroplethmapbox(geojson=geojson,
locations=df['boro_name'],
z=df[2010],
colorscale='temps',
featureidkey="properties.boro_name",
zmin=0,
zmax=11,
marker_opacity=0.5,
marker_line_width=0.1
)
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.update_layout(coloraxis_colorscale='temps',
mapbox=dict(style='carto-positron',
zoom=9.3, center = {"lat": 40.7, "lon": -73.984},
))
button1 = dict(method='update',
label='2010',
args=[{'z':[df[2010]]},
{'coloraxis.colorscale':'temps'}])
button2 = dict(method='update',
label='2011',
args=[{'z':[df[2011]]},
{'coloraxis.colorscale':'temps'}])
button3 = dict(method='update',
label='2012',
args=[{'z':[df[2012]]},
{'coloraxis.colorscale':'temps'}])
button4 = dict(method='update',
label='2013',
args=[{'z':[df[2013]]},
{'coloraxis.colorscale':'temps'}])
button5 = dict(method='update',
label='2014',
args=[{'z':[df[2014]]},
{'coloraxis.colorscale':'temps'}])
button6 = dict(method='update',
label='2015',
args=[{'z':[df[2015]]},
{'coloraxis.colorscale':'temps'}])
button7 = dict(method='update',
label='2016',
args=[{'z':[df[2016]]},
{'coloraxis.colorscale':'temps'}])
button8 = dict(method='update',
label='2017',
args=[{'z':[df[2017]]},
{'coloraxis.colorscale':'temps'}])
button9 = dict(method='update',
label='2018',
args=[{'z':[df[2018]]},
{'coloraxis.colorscale':'temps'}])
button10 = dict(method='update',
label='2019',
args=[{'z':[df[2019]]},
{'coloraxis.colorscale':'temps'}])
button11 = dict(method='update',
label='2020',
args=[{'z':[df[2020]]},
{'coloraxis.colorscale':'temps'}])
fig.update_layout(updatemenus=[dict(active=0,
buttons=[button1, button2, button3, button4, button5, button6, button7, button8, button9, button10, button11])]
)
fig.show()
#fig.write_html("choroplethmap.html")
Using the Plotly Express package we have chosen to visualize a bar chart containing the aggregation of PM2.5 from both NYC and the national average over a ten year period. This bar chart fits good with our story as it clearly visualizes the how the PM2.5 polution has decreased over time and how the PM2.5 polution used to be higher than the PM2.5 national average, but since 2013 has been under the PM2.5 national average. This development will support our argumentation that the environmental initiatives introduced by the Mayor's Office of Climate and Environmental Justice has actually positively affected the air polution in NYC. The code for bar chart can be seen in the following cell below:
fig = px.bar(PM_avg2, x='Year', y=['New York City', 'National average'],
range_y=[0,15],
text_auto=True,
color_discrete_sequence=["indianred", "lightsalmon"])
fig.update_layout(barmode='group')
fig.update_layout(legend=dict(
yanchor="top",
y=0.99,
xanchor="left",
x=0.01,
title='PM2.5 Measures'
), yaxis_title='PM2.5 Concentration')
fig.update_traces(hovertemplate = None,
hoverinfo = "skip")
fig.show()
#fig.write_html("barchart_comparison.html")
The plot is a stacked bar chart made with the Bokeh library which makes it possible to make it interactive through the hover tool meaning that hovering the mouse over the plot will provide information on the specific section.
As we wanted to gain an insight to the decreasing PM2.5 concentration level effects admissions we found it to provide an manageable overview with the right amount of information to analyze if there is a correlation. We decided to group the data in boroughs as we already had made the analysis of the pollution in specific boroughs and we therefore found that it could be interesting to explore if there actually was a direct correlation between the decrease of pollution in a borough with the amount of admission in the same borough.
grouped2 = df_admissions.groupby(['Year', 'Borough']).size().reset_index(name='Admissions')
# Pivot the table so that the categories become the columns and the counts become the values
pivoted = grouped2.pivot(index='Year', columns='Borough', values='Admissions')
# Reset the index to make year a regular column
pivoted = pivoted.reset_index()
TOOLS = "save,pan,box_zoom,reset,wheel_zoom,tap"
source2 = ColumnDataSource(data=pivoted)
p2 = figure(plot_width=800, title="Number of admissions by Year", toolbar_location='above', tools=TOOLS)
n_counties = df_admissions["Borough"].nunique()
colors = Category20[n_counties]
color_map = {county: colors[i % n_counties] for i, county in enumerate(list(pivoted.columns[1:]))}
counties = list(pivoted.columns[1:])
p2.vbar_stack(counties, x='Year', width=0.9, color=[color_map[county] for county in counties],
source=source2, legend_label=counties)
p2.y_range.start = 0
p2.y_range.end = 110000
p2.x_range.range_padding = 0.1
p2.xgrid.grid_line_color = None
p2.axis.minor_tick_line_color = None
p2.outline_line_color = None
p2.xaxis.axis_label_text_font_style = 'bold'
p2.yaxis.axis_label = 'Admissions'
p2.yaxis.axis_label_text_font_style = 'bold'
p2.title.text_font_size = '16pt'
p2.title.text_font_style = 'bold'
p2.yaxis.formatter = NumeralTickFormatter(format="0,0")
p2.xaxis.axis_label = 'Year'
p2.yaxis.axis_label = 'Admissions'
p2.legend.location = "top_left"
p2.legend.orientation = "horizontal"
#tooltip
tooltips = [
("Year", "@{Year}"),
("County", "$name"),
("Admissions", "@$name")
]
#hovertool
p2.add_tools(HoverTool(tooltips=tooltips))
show(p2)
#output_file("bokeh_stacked.html")
#save(p2)
We have chosen to develop a bar chart with the Plotly library to incorporate a slider and “animation” of the development in admissions over the ten year period. The bucket sizes are determined by the age intervals given in the dataset.
We found that this plot type was right for the story as we wanted to take a step further into the dataset of admissions providing an increased understanding of which segments were the most impacted by the decreased amount of air pollutants. The bar chart was chosen as it provides an easy overview and we didn’t want to provide too much information in the plot as it would make it difficult to comprehend all the information if the reader decides to use the play button to see the development over time.
grouped3 = df_admissions.groupby(['Age Group','Year']).size().reset_index(name='Admissions')
# create a histogram plot with a slider
fig = px.histogram(grouped3, x="Age Group", y="Admissions",
color="Age Group", animation_frame="Year",
range_y=[0, max(grouped3["Admissions"])*1.1])
# add y-labels, x-labels, and a title
fig.update_layout(
xaxis_title="Age group",
yaxis_title="Admissions",
title="<b>Total age group hospital admissions by Year"
)
# show the plot
fig.show()
#fig.write_html("barchart_age.html")
For the creation of our webpage, story and visualization we are quite satisfied with the overall result. The following highlights a couple of the things we found to have worked the best and what we see as our successes:
Visualizations: We are all quite glad with the visuals we managed to create and found that we learned a lot by trying to incorporate various libraries (Bokeh, Plotly, JSON etc.). While doing so we feel that we succeeded quite well with making the visuals interactive and supportive of the story we wanted to tell through sliders, buttons etc..
Storytelling & Analysis: We found the storytelling to be quite interesting and were positively surprised by the amount of existing qualitative information available specifically from NYC which made it possible for us to show and tell our story with the use of state specific information. We thereby think that the story and our data analysis have turned out well and with a nice “red thread” which takes the reader through our findings.
Conclusion: A bit related to the storytelling but we still thought that we needed to highlight that the data made it possible for us to have some kind of conclusion and that the concentration levels of PM2.5 actually decreased over time showing that the focus which NYC have had on air pollution actually have had an effect.
When working with data and creating a final project there will always be parts which could be improved or other things which could have been improved which is highlighted in the following:
Preliminary data handling: Unfortunately we ended up having a few challenges throughout the making of our plots and data analysis as we started with a different dataset than we ended up using. This meant that we spent a lot of time on a dataset which we due to an incomplete preliminary data analysis first found to be insufficient due to missing data after working with the data for a couple of days.
Admission data: We would have wished that there existed admissions data which included a specific date for the admission and not only a year as this would have made it possible for us to make a correlation plot between daily admissions and daily observations and thereby show the relationship between the two factors in a more detailed way. However, even though the dataset included everything from the cost of the admission to the ethnicity of the patient we were not able to find any dataset containing the day of admission.
More datasets: We initially would have included data for other pollutants such as ozone and SO2 as it would have been interesting to look into how these levels have developed compared to PM2.5. We ended up not including them as we couldn’t find a dataset including multiple pollutants. We were able to find data on the other pollutants however there were several missing periods of observations for some of the boroughs and it could therefore have created a skewed analysis resulting in a conclusion made on misleading data.
What if we had more time: If we were to work with this topic for a longer time we all agree that finding more data would definitely be the right direction to go. This could as an example be to spend more time on finding datasets with complete data for other pollutants or data for European/US cities to compare the initiatives each city has made and thereby determine which have had the most impact to create a comprehensive data driven guide to decrease the air pollution.
| Section/Plot | Andreas | Magnus | Simon |
|---|---|---|---|
| Data Selection | 33% | 33% | 33% |
| Data cleaning and preliminary analysis | 25% | 25% | 50% |
| Setup of Webpage | 25% | 50% | 25% |
| Plot: Bokeh Line Chart | 100% | 0% | 0% |
| Plot: Plotly Boxplot | 0% | 100% | 0% |
| Plot: JSON Choropleth | 0% | 0% | 100% |
| Plot: Bar Chart(national avg) | 0% | 0% | 100% |
| Plot: Bokeh Stacked Bar Chart | 0% | 100% | 0% |
| Plot: Plotly Bar Chart (Age adm.) | 100% | 0% | 0% |
| Story | 50% | 25% | 25% |
| Litterature review | 40% | 30% | 30% |
| Explanatory Notebook | 20% | 40% | 40% |
See "References" tab on website.